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Figure  1 :  Although  all  image  patches  on  the  left  are  just  noise,  when  we  show  thousands  of  them  to  online  workers  and  ask 
them  to  find  ones  that  look  like  cars,  suddenly  a  car  emerges  in  the  average,  shown  on  the  right.  This  noise-driven  method  is 
based  on  well  known  tools  in  human  psychophysics  that  estimates  the  decision  boundary  that  the  human  visual  system  uses 
for  recognition.  In  this  paper,  we  explore  how  classifiers  acquired  from  human  imagination  can  be  transferred  into  a  machine. 


Abstract 

The  human  mind  can  remarkably  imagine  objects  that  it 
has  never  seen,  touched,  or  heard,  all  in  vivid  detail.  Moti¬ 
vated  by  the  desire  to  harness  this  rich  source  of  information 
from  the  human  mind,  this  paper  investigates  how  to  extract 
classifiers  from  the  human  visual  system  and  leverage  them 
in  a  machine.  We  introduce  a  method  that,  inspired  by  well- 
known  tools  in  human  psychophysics,  estimates  the  classi¬ 
fier  that  the  human  visual  system  might  use  for  recognition, 
but  in  computer  vision  feature  spaces.  Our  experiments  are 
surprising,  and  suggest  that  classifiers  from  the  human  vi¬ 
sual  system  can  be  transferred  into  a  machine  with  some 
success.  Since  these  classifiers  seem  to  capture  favorable 
biases  in  the  human  visual  system,  we  present  a  novel  SVM 
formulation  that  constrains  the  orientation  of  the  SVM  hy¬ 
perplane  to  agree  with  the  human  visual  system.  Our  results 
suggest  that  transferring  this  human  bias  into  machines  can 
help  object  recognition  systems  generalize  across  datasets. 
Moreover,  we  found  that  people 's  culture  may  subtly  vary 
the  objects  that  people  imagine,  which  influences  this  bias. 
Overall,  human  imagination  can  be  an  interesting  resource 
for  future  visual  recognition  systems. 

1.  Introduction 

''Logic  will  get  you  from  A  to  Z;  imagination  will 

get  you  everywhere.”  —  Albert  Einstein 
Computers  routinely  beat  the  human  brain  on  challenges 
with  logic  and  calculation  speed.  But,  when  it  comes  to  ob¬ 


ject  recognition,  humans  are  still  the  state-of-the-art.  What 
is  the  key  difference  between  human  recognition  and  ma¬ 
chine  recognition? 

One  answer  is  that  the  best  object  recognition  systems 
today  are  unable  to  imagine  objects  that  they  have  never  en¬ 
countered.  However,  the  human  mind  can  effortlessly  imag¬ 
ine  objects  that  it  has  never  seen,  touched,  or  heard.  Even 
more  remarkably,  humans  can  do  this  in  any  color,  orienta¬ 
tion,  deformation,  put  upside  down,  in  and  out  of  context, 
all  in  vivid  detail. 

In  this  paper,  we  seek  to  transfer  the  mental  images  of 
what  a  human  can  imagine  into  an  object  recognition  sys¬ 
tem.  We  combine  the  strengths  of  two  approaches:  state-of- 
the-art  features  in  computer  vision  [7,  23]  with  a  method  in 
human  psychophysics  [2]  that  estimates  the  decision  bound¬ 
ary  that  the  human  visual  system  uses  for  recognition. 

Consider  what  may  seem  like  an  odd  experiment:  we 
sample  white  noise  in  a  visual  feature  space  from  a  standard 
normal  distribution.  What  is  the  chance  that  this  sample  is 
a  car?  Fig. la  visualizes  some  samples  using  feature  inver¬ 
sion  [38]  and,  as  expected,  we  see  noise.  But,  let  us  not 
stop  there.  We  next  generate  one  hundred  thousand  points 
from  the  same  distribution,  and  ask  workers  on  Amazon 
Mechanical  Turk  to  classify  each  sample  as  a  car  or  not. 
Fig.lc  shows  the  average  of  visual  features  that  workers 
believed  were  cars.  Although  our  dataset  consists  of  only 
white  noise,  a  car  emerges ! 

While  sampling  noise  may  seem  unusual  to  computer 
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vision  researchers,  a  similar  procedure,  named  classifica¬ 
tion  images,  has  gained  popularity  in  human  psychophysics 
[2]  for  estimating  the  template  the  human  visual  system  in¬ 
ternally  uses  for  recognition  [20,  3].  In  the  procedure,  an 
observer  looks  at  random  noise  and  indicates  whether  they 
perceive  a  target  category.  After  a  large  number  of  trials, 
psychophysics  researchers  can  apply  basic  statistics  to  ex¬ 
tract  the  internal  template  the  observer  used  for  recognition. 
We  discovered  that  a  similar  approach  can  be  used  to  build 
a  coarse  object  recognition  system  that  originated  from  peo¬ 
ple’s  imagination. 

Motivated  by  the  observation  that  human  visual  system 
is  a  rich  source  of  information,  this  paper  investigates  the 
scientific  question  whether  visual  classifiers  acquired  from 
human  imagination  can  be  leveraged  computationally.  In¬ 
spired  by  classification  images,  we  introduce  a  method  to 
estimate  imaginary  classifiers  from  the  human  mind,  but  in 
a  feature  space  that  is  compact  and  discriminative  for  com¬ 
puters.  To  our  knowledge,  we  are  the  first  to  extract  classi¬ 
fiers  from  the  human  visual  system  in  computer  vision  fea¬ 
ture  spaces.  We  then  present  a  novel  SVM  formulation  that 
integrates  knowledge  from  the  human  visual  system  by  con¬ 
straining  the  SVM  solution  to  be  oriented  close  to  the  imag¬ 
inary  classifier.  Our  experiments  are  surprising,  and  suggest 
that  classifiers  from  the  human  mind  might  be  transferrable 
into  a  machine. 

In  addition,  we  found  that  imaginary  classifiers  are  use¬ 
ful  in  two  particular  computer  vision  applications.  Firstly, 
since  these  classifiers  do  not  depend  on  real  images,  we 
can  build  recognition  systems  in  situations  where  it  is  diffi¬ 
cult  to  collect  data.  Our  results  suggest  that  it  is  possible 
to  recognize  objects  in  images  in  the  wild  without  train¬ 
ing  on  any  real  images.  Secondly,  since  imaginary  classi¬ 
fiers  are  estimated  only  by  humans  looking  at  noise,  they 
inherit  biases  from  the  human  visual  system.  Our  experi¬ 
ments  suggest  that  the  bias  from  the  human  visual  system 
is  favorable,  and  can  improve  generalization  performance 
across  datasets.  Overall,  these  experiments  hint  that  human 
imagination  can  be  an  interesting  resource  for  future  visual 
recognition  systems. 

1.1.  Related  Work 

This  paper  acquires  a  recognition  system  from  the  hu¬ 
man  mind  by  combining  several  popular  methods.  While 
each  individual  method  is  standard,  their  combination  is 
novel.  We  briefly  review  the  related  work  in  both  human 
and  computer  vision. 

Mental  Images:  Our  methods  build  upon  work  to  extract 
mental  images  from  a  user’s  head  for  both  general  objects 
[16]  and  faces  [26].  However,  our  work  differs  because  we 
estimate  mental  images  in  state-of-the-art  computer  vision 
feature  spaces,  which  allows  us  to  integrate  the  mental  im¬ 
ages  into  an  object  recognition  system. 


Human-in- the-Loop:  The  idea  to  transfer  classifiers 
from  the  human  mind  into  object  recognition  is  inspired  by 
many  recent  works  that  puts  a  human  in  the  computer  vi¬ 
sion  loop  [5,  10,  29],  trains  recognition  systems  with  active 
learning  [36,  34],  and  studies  crowdsourcing  [37,  32,  40]. 
The  primary  difference  of  these  approaches  and  our  work  is, 
rather  than  using  crowds  as  a  workforce,  we  want  to  extract 
classifiers  from  the  worker’s  minds  using  methods  rooted  in 
human  psychophysics. 

Transfer  Learning:  We  build  upon  methods  in  transfer 
learning  to  incorporate  priors  into  learning  algorithms.  A 
common  transfer  learning  method  for  SVMs  is  to  change 
the  max-margin  regularization  term  ||u;||2  to  \\w  —  cWl 
where  c  is  the  prior  [31].  However,  this  imposes  an  prior 
on  the  norm  of  of  ic.  In  our  case,  since  the  imaginary  clas¬ 
sifier  does  not  provide  an  additional  prior  on  the  norm,  we 
introduce  a  novel  SVM  formulation  that  constrains  only  the 
orientation  of  w  to  be  close  to  c.  Our  approach  extends  sign 
constraints  on  SVMs  [12],  but  instead  enforces  orientation 
constraints. 

Deep  Learning:  There  is  a  large  body  of  work  study¬ 
ing  deep  learning  [24,  23],  which  hopes  to  build  models 
that  mimic  neuron  activations  in  the  human  brain.  While 
our  work  is  also  inspired  by  biological  vision,  we  are  only 
interested  in  estimating  the  classifier  parameters  for  very 
specific  recognition  tasks. 

Human  Psychophysics:  Finally,  our  ideas  extend  clas¬ 
sification  images  [20,  3],  a  tool  in  psychophysics  to  es¬ 
timate  decision  boundaries  that  the  human  visual  system 
uses.  Firstly,  while  classification  images  have  been  mostly 
restricted  to  images  and  audio,  we  are  the  first,  to  our  knowl¬ 
edge,  to  apply  it  to  feature  spaces  in  computer  vision.  Sec¬ 
ondly,  our  approach  uses  only  noise  to  estimate  classifiers. 
Unlike  classification  images,  we  do  not  use  any  real  images. 
We  capitalize  on  the  ability  of  people  to  discern  visual  ob¬ 
jects  from  random  noise  in  a  systematic  manner  [17]. 

2.  Acquiring  Classifiers 

In  this  section,  we  describe  how  to  acquire  classifiers 
from  the  human  visual  system.  We  first  review  a  popular 
method  in  human  psychophysics  for  performing  this  task. 
Then,  we  adopt  it  for  use  in  a  computer. 

2.1.  Classification  Images 

We  first  review  classification  images,  a  popular  method 
in  human  psychophysics  that  estimates  the  internal  template 
that  the  human  visual  system  uses  for  recognition  [20,  3]. 
The  goal  is  to  approximate  the  template  c  G  that  the 
human  visual  system  uses  for  recognition. 

The  intuition  behind  classification  images  is  simple,  but 
powerful.  We  wish  to  discover  how  a  human  observer  dis¬ 
criminates  between  two  classes  A  and  B,  e.g.  male  vs.  fe¬ 
male  faces,  or  chair  vs.  not  chair.  Suppose  we  have  real 
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images  a  G  A  C  and  b  e  B  C  If  we  sample 
white  noise  e  ^  A/’(0^,  Id)  and  ask  an  observer  to  indicate 
the  class  label  for  a  +  e,  most  of  the  time  the  observer  will 
answer  class  A.  However,  there  is  a  chance  that  e  might  ma¬ 
nipulate  a  to  cause  the  observer  to  mistakenly  label  a  +  e  as 
class  B. 

The  key  insight  into  classification  images  is  that,  if  we 
perform  a  large  number  of  trials,  then  we  can  estimate  a  de¬ 
cision  function  /(•)  that  discriminates  between  A  and  B, 
but  makes  the  same  mistakes  as  the  observer.  Since  /(•) 
makes  the  same  errors,  it  provides  a  good  model  for  the  in¬ 
ternal  decision  function  that  the  observer  uses.  By  analyz¬ 
ing  this  model,  we  can  then  gain  insight  into  how  the  human 
visual  system  discriminates  between  A  and  B. 

If  we  assume  that  the  human  visual  system  uses  the  lin¬ 
ear  decision  boundary  of  the  form  f{x;c)  =  c^x,  then 
[27]  shows  that  the  classification  image  c  with  the  optimal 
signal-to-noise  ratio  is: 

c  =  {jJ^AA  +  IXba)  —  {j^AB  +  I^Bb)  (1) 

where  jipq  G  is  the  average  image  where  the  original 
was  class  P  but  the  observer  predicted  Q. 

Is  it  reasonable  to  assume  that  classification  images 
should  be  linear?  Although  there  is  overwhelming  evidence 
that  object  recognition  in  the  human  brain  is  nonlinear,  a  lin¬ 
ear  classification  image  is  reasonable  because  we  only  seek 
an  approximation  of  the  human  decision  boundary.  More¬ 
over,  while  nonlinear  models  are  possible,  they  require  sig¬ 
nificantly  more  trials  to  estimate,  which  is  expensive,  and 
in  practice  we  see  good  results  with  linear  models.  We  do, 
however,  wish  to  point  out  promising  efforts  that  study  non¬ 
linear  classification  images  [28]. 

2.2.  Imaginary  Classifiers 

Since  psychophysics  researchers  are  interested  in  under¬ 
standing  how  the  human  brain  functions,  they  want  to  ex¬ 
tract  classifiers  from  the  human  visual  system  that  are  inter¬ 
pretable.  Consequently,  they  build  classification  images  in 
pixel  space  for  geometric  shapes  or  faces.  However,  we  are 
interested  in  extracting  classifiers  to  use  in  a  computer.  In¬ 
spired  by  classification  images,  we  present  an  approach  that 
acquires  classifiers  from  the  human  visual  system,  but  in  the 
same  feature  spaces  as  computer  vision  systems.  Our  new 
method,  which  we  refer  to  as  imaginary  classifiers,  uses  two 
key  ideas. 

Firstly,  we  captialize  on  recent  work  in  feature  inver¬ 
sion  [38,  39,  8,  21].  Rather  than  generating  noise  in  pixel 
space,  we  generate  noise  in  feature  space.  We  then  invert 
the  noise  features  back  to  an  image  and  ask  humans  to  label 
the  feature  visualization.  Since  machines  understand  fea¬ 
tures  and  humans  understand  visualizations,  we  are  able  to 
build  a  classifier  that  makes  similar  recognition  mistakes  as 
humans,  but  in  a  space  that  is  discriminative  for  computers. 


Secondly,  we  found  that  humans  are  surprisingly  good  at 
imagining  objects  in  visualizations  of  feature  space  noise. 
When  we  instruct  people  to  label  visualizations  of  just  white 
Gaussian  noise  (with  no  real  images),  people  frequently  find 
white  noise  that  looks  like  objects.  Feature  descriptors  of¬ 
ten  have  structure  (e.g.,  encodings  of  gradients  or  colors) 
that  likely  causes  white  noise  in  feature  space  to  invert  to 
images  that  look  like  objects.  Although  people  are  incorrect 
when  they  label  pure  noise  as  an  object,  they  are  providing 
information  about  how  the  human  visual  system  discrimi¬ 
nates  objects  in  computer  vision  feature  spaces. 

We  propose  to  build  imaginary  classifiers  by  combining 
feature  inversion  with  people’s  ability  to  discern  objects  in 
pure  noise.  We  first  sample  noise  from  a  zero-mean,  unit- 
covariance  Gaussian  distribution  x  ^  N'iOd,  Id)-  We  then 
invert  the  noise  feature  x  back  to  an  image  (j)~^{x)  where 
(j)~^{’)  is  the  feature  inverse.  By  instructing  people  to  in¬ 
dicate  whether  a  visualization  of  noise  is  a  target  category 
or  not,  we  can  build  a  linear  classifier  c  G  that  approxi¬ 
mates  the  decisions  of  their  visual  system: 

C  =  jl  A-  l^B  (2) 

where  /ia  ^  is  the  average,  in  feature  space,  of  white 
noise  that  workers  incorrectly  believed  was  an  actual  object, 
and  similarly  fiB  G  is  the  average  of  noise  that  workers 
correctly  labeled  as  noise.  Since  we  sample  white  Gaussian 
noise,  Eqn.2  can  be  interpretted  as  an  LDA  classifier  [18] 
over  labeled  noise  where  the  covariance  is  identity,  =  1} 
Moreover,  observe  Eqn.2  is  a  special  case  of  the  original 
human  psychophysics  Eqn.  1  where  the  background  class  B 
is  white  noise  and  the  positive  class  A  is  empty.  Instead,  we 
rely  on  humans  to  hallucinate  objects  in  noise  to  form  fiA- 

Since  we  average  noise  in  feature  space  instead  of  pixel 
space,  we  have  two  advantages  over  standard  classifica¬ 
tion  images.  Firstly,  imaginary  classifiers  are  in  a  feature 
space  that  is  compact  and  discriminative,  which  allows  us 
to  plug  the  classifier  into  a  machine.  Secondly,  since  we 
build  imaginary  classifiers  with  only  white  Gaussian  noise 
and  no  real  images,  our  approach  is  immune  to  many  is¬ 
sues  in  dataset  bias  [35].  Instead,  imaginary  classifiers  in¬ 
herit  the  biases  present  in  the  human  visual  system,  which 
we  suspect  provides  advantangeous  signals  about  the  visual 
world. 

We  were  able  to  estimate  c  with  one  hundred  thousand 
trials.  We  picked  an  aspect  ratio  appropriate  for  our  tar¬ 
get  category,  sampled  one  hundred  thousand  points  in  fea¬ 
ture  space  from  the  standard  normal  multivariate  distribu¬ 
tion,  and  inverted  each  sample  with  HOGgles  [38].  We  then 
put  each  visualization  on  Amazon  Mechanical  Turk  [32] 
and  instructed  workers  to  indicate  whether  they  see  the  tar¬ 
get  category  or  not.  Since  we  found  that  the  interpretation 

^We  tried  training  other  classifiers  too  (such  as  SVM),  but  we  did  not 
see  any  advantage  in  our  experiments. 
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Figure  2:  We  visualize  some  decision  boundaries  acquired  from  the  Mechanical  Turk  workers’  minds.  Although  they  are 
blurred,  in  many  cases  significant  detail  can  be  observed.  Notice  that  the  car  classifier  captures  a  darker  road  below  the  car, 
and  a  lighter  sky  towards  the  top.  The  television  shows  a  rectangular  structure,  the  person  mimics  a  pedestrian,  and  the  valves 
can  be  seen  in  the  fire  hydrant. 


of  noise  visualizations  depends  on  the  scale,  we  show  the 
worker  three  different  scales.  We  paid  workers  100  to  label 
100  images,  and  the  workers  were  fast,  often  solving  the 
entire  one  hundred  thousand  images  in  a  few  hours.  Our  ex¬ 
periments  were  affordable,  with  each  classifier  only  costing 
around  $100  to  build.  In  order  to  assure  quality,  we  occa¬ 
sionally  gave  workers  an  easy  example  to  which  we  knew 
the  answer,  and  only  retained  work  from  workers  who  per¬ 
formed  well  above  chance.  We  only  used  the  easy  examples 
to  qualify  workers,  and  discarded  them  for  computing  the 
final  classifier. 

Surprisingly,  although  subjects  are  classifying  zero- 
mean,  identity  covariance  white  Gaussian  noise,  objects 
suddenly  emerge  after  many  trials.  We  visualize  some  of 
the  imaginary  classifiers  in  Fig. 2.  In  many  cases,  we  can 
observe  significant  detail.  For  example,  in  the  car  classifier, 
we  can  clearly  see  a  vehicle-like  object  in  the  center  sitting 
on  top  of  a  dark  road  and  light  sky.  The  television  clearly 
shows  a  rectangular  structure,  and  the  fire  hydrant  reveals  a 
red  hydrant  with  two  arms  on  the  side. 

3.  Experiments  on  Imaginary  Classifiers 

There  is  a  large  class  of  visual  objects  that  humans  can 
imagine,  but  they  have  never  seen.  In  order  to  explore  the 
extent  that  human  imagination  can  play  a  role  in  computer 
vision,  we  want  to  scientifically  understand  how  well  we 
can  acquire  classifiers  from  the  human  visual  system  and 
leverage  them  computationally.  Hence,  we  will  evaluate 
how  well  we  can  extract  imaginary  classifiers  by  quantify¬ 
ing  their  ability  to  discriminate  and  recognize  objects. 


3.1.  Experimental  Setup 

We  evaluate  our  methods  on  object  classification.  We 
assume  object  localization  is  given  and  the  task  is  to  pre¬ 
dict  the  category  of  each  window.  We  conduct  our  experi¬ 
ments  on  the  PASCAL  VOC  2011  dataset  [13],  evaluating 
against  the  validation  set.^  We  report  performance  as  the 
average  precision  on  a  precision-recall  curve.  We  show  re¬ 
sults  for  two  sets  of  features:  HOG  [7]  and  the  last  con¬ 
volutional  layer  (pool5)  of  a  convolutional  neural  network 
(CNN)  trained  on  ImageNet  [23,  9].  We  use  the  Felzen- 
szwalb  et  al.  implementation  of  HOG  [15]  and  Decaf  for  ex¬ 
tracting  CNN  features  [11].  We  trained  inversions  for  both 
features  with  paired  dictionary  learning  [38].  All  classifi¬ 
cation  images  are  estimated  on  Amazon  Mechanical  Turk 
with  150,  000  trials. 

3.2.  Evaluation 

The  results  in  Fig. 3  suggest  that  our  imaginary  classi¬ 
fiers  are  capturing  some  signal  from  the  human  visual  sys¬ 
tem.  Although  the  classifiers  are  estimated  using  only  white 
noise,  in  nearly  every  case  the  imginary  classifiers  are  sig¬ 
nificantly  outperforming  chance.  The  delta  in  AP  is  occa¬ 
sionally  large,  with  performance  on  person  doubling  and 
television  performance  increasing  an  order  of  magnitude.^ 
These  results  suggest  that  we  are  able  to  acquire  some  sig¬ 
nal  from  the  human  visual  system  and  start  to  leverage  it 

^We  added  63  annotated  fire  hydrants  to  the  dataset  for  reasons  that 
will  become  clear  later. 

^Although  most  researchers  walk  past  a  fire  hydrant  every  day,  they 
are  not  annotated  in  any  major  recognition  dataset,  including  ImageNet. 
However,  since  imaginary  classifiers  do  not  require  datasets,  we  are  still 
able  to  recognize  them! 
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Figure  3:  We  show  the  average  precision  (AP)  for  object 
classification  on  PASCAL  VOC  2011  using  the  classifi¬ 
cation  image.  Even  though  the  classification  image  was 
created  without  a  dataset,  it  performs  significantly  above 
chance  in  nearly  every  case  (green).  If  a  machine  learning 
algorithm  were  trained  without  data,  the  best  it  could  do  is 
chance. 


computationally. 

Moreover,  the  misclassifications  for  the  imaginary  clas¬ 
sifiers  are  often  sensible.  Fig.4  shows  the  class  confusions 
for  the  top  classification  for  each  classifier.  Notice  that  cars 
are  frequently  confused  with  other  vehicles,  and  bottles  are 
commonly  confused  with  people.  We  hypothesize  that  fu¬ 
ture  work  in  building  higher  resolution  classifiers  will  re¬ 
solve  some  of  these  issues.  The  number  of  noise  trials 
needed  to  estimate  a  imaginary  classifier  is  also  feasible, 
making  the  method  affordable.  Fig. 5  shows  performance 
versus  the  number  of  noise  trials  for  a  few  categories.  In 
many  cases,  10,  000  positive  trials  is  enough  to  estimate  a 
classifier.  Performance  does  not  appear  to  have  yet  satu¬ 
rated,  suggesting  that  better  classifiers  can  be  created  with 
more  trials. 

We  note  that  one  potential  concern  in  our  experiments  is 
that  the  CNN  features  are  trained  to  discriminate  on  Im- 
ageNet  [9]  LSVRC  2012,  and  hence  had  access  to  data. 
To  address  this  concern,  we  have  shown  results  for  HOG 
as  well,  which  is  a  hand-crafted  feature.  Additionally,  we 
showed  results  for  categories  that  the  CNN  network  did  not 
see  during  training  (people  and  fire  hydrants). 

3.3.  Analysis 

Our  experiments  indicate  that  imaginary  classifiers  con¬ 
tain  some  discriminative  power.  By  analyzing  what  the  clas¬ 
sifier  is  using  for  discrimination,  we  can  gain  insight  into 
how  the  human  visual  system  recognizes  objects  in  com¬ 
puter  vision  feature  spaces. 

Our  results  suggest  that  shape  is  important  for  the  imagi¬ 


Figure  4:  We  plot  the  class  confusions  for  each  imaginary 
classifier  for  top  classifications  for  CNN  features.  We  show 
only  the  top  10  classes  for  visualization.  Notice  that  many 
of  the  confusions  are  semantically  meaningful,  e.g.  the  clas¬ 
sifier  for  car  tends  to  retrieve  vehicles,  and  the  fire  hydrant 
classifier  commonly  mistakes  people  and  bottles. 
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Figure  5:  We  plot  object  classification  performance  on  PAS¬ 
CAL  VOC  with  CNN  features  versus  number  of  positive 
noise  trials.  As  the  number  of  trials  increase,  performance 
increases  as  well.  Our  results  suggest  that  performance  has 
not  yet  saturated  for  many  categories.  Error  bars  show  stan¬ 
dard  deviation  over  10  random  samples. 


nary  classifier  to  discriminate  in  CNN  feature  space.  Notice 
how  the  top  classifications  in  Fig. 6  tend  to  share  the  same 
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Figure  7 :  We  visualize  the  conic 
constraint  on  the  SVM  solution  w. 
The  feasible  space  for  the  solu¬ 
tion  is  the  grayed  hypercone.  The 
SVM  solution  w  is  not  allowed 
to  deviate  from  c  by  more  than 
cos~^{0)  degrees. 


4.1.  SVM  with  Orientation  Priors 


Figure  6:  We  show  some  of  the  top  classifications  from  the 
imaginary  classifiers  estimated  with  CNN  features. 


rough  shape  by  category.  For  example,  the  classifier  for 
person  finds  people  that  are  upright,  and  the  television  clas¬ 
sifier  fires  on  rectangular  shapes.  The  confusions  in  Fig. 4 
confirm  these  findings  since  bottles  are  often  confused  as 
people,  and  cars  are  confused  as  buses.  Moreover,  the  vi¬ 
sualization  of  the  classifers  in  Fig. 2  attempts  to  show  the 
canonical  shape  that  the  classifier  has  learned.  Although  the 
visualization  is  blurry,  oftentimes  one  can  see  strong  shape 
details,  such  as  the  valves  appearing  in  the  fire-hydrant.  In¬ 
deed,  shape  seems  to  be  important. 

In  addition  to  shape,  some  imaginary  classifiers  appear 
to  rely  on  color  as  well.  Fig. 6  suggests  that  the  classifier 
for  fire-hydrant  correctly  favors  red  objects,  which  is  evi¬ 
denced  by  it  frequently  firing  on  people  wearing  red  clothes. 
The  bottle  classifier,  although,  seems  to  be  incorrectly  bi¬ 
ased  towards  blue  objects,  which  contributes  to  its  poor  per¬ 
formance.  We  suspect  that  the  Mechanical  Turk  workers 
likely  subconsciously  biased  the  bottle  classifier  towards 
blue.  While  color  is  not  as  important  as  shape,  color  ap¬ 
pears  to  be  useful  for  humans  to  recognize  objects  in  noise. 

These  results  suggest  together  that  the  human  visual  sys¬ 
tem  encodes  some  bias  towards  the  shape  and  color  of  ob¬ 
jects.  Since  humans  are  the  best  object  recognition  agents, 
we  suspect  that  this  bias  is  favorable.  In  the  remainder  of 
this  paper,  we  will  explore  how  we  can  use  these  biases  from 
the  human  visual  system. 


4.  Learning  with  Imaginary  Classifiers 

Our  experiments  suggest  that  imaginary  classifiers  pro¬ 
vide  a  good  template  for  the  features  that  the  human  vi¬ 
sual  system  finds  discriminative  for  recognition  between 
two  classes.  Since  the  human  visual  system  is  the  best  ob¬ 
ject  recognition  system  today,  we  suspect  that  integrating 
imaginary  classifiers  with  machine  learning  can  be  power¬ 
ful.  In  this  section,  we  present  a  novel  SVM  formulation 
that  incorporates  knowledge  from  the  human  visual  system 
by  constraining  the  SVM  hyperplane  to  have  a  similar  ori¬ 
entation  to  the  imaginary  classifier. 


Let  Xi  G  be  a  training  point  and  i/i  G  {  —  1, 1}  be  its 
label  for  1  <  i  <  n.  The  SVM  seeks  a  separating  hyper¬ 
plane  w  G  with  a  bias  6  G  M  that  maximizes  the  margin 
between  positive  and  negative  examples.  We  wish  to  add 
the  constraint  that  the  SVM  hyperplane  w  must  be  at  least 
cos“^  (0)  degrees  away  from  the  imaginary  classifier  c: 


.  ^  'T' 

mm— u;  w 

w,b,^  2 


s.t.  yi  {w'^Xi  +  b)  ^i>0 

e  < 


W  C 


\/w^w 


(3a) 

(3b) 

(3c) 


where  G  M  are  the  slack  variables,  A  is  the  regularization 
hyperparameter,  and  Eqn.Sc  is  the  orientation  prior  such 
that  0  G  (0, 1]  is  the  maximum  angle  that  the  w  is  allowed 
deviate  from  c.  Note  that  we  have  assumed,  without  loss  of 
generality,  that  ||c||2  =  1.  Fig. 7  shows  a  visualization  of 
this  orientation  prior. 

4.2.  Learning 

Optimizing  Eqn.3  directly  seems  to  be  challenging  due 
to  the  constraint  in  Eqn.3c.  However,  it  is  possible  to  write 
the  above  objective  as  a  conic  program.  Since  conic  pro¬ 
grams  are  convex  by  construction,  we  can  then  optimize  it 
using  off-the-shelf  solvers. 

We  rewrite  Eqn.3c  as  yw^w  <  and  introduce  an 

i -  T  -- 

auxiliary  variable  a  G  M  such  that  vw^w  <  a  < 
Substituting  these  constraints  into  Eqn.3  and  replacing  the 
SVM  regularization  term  with  |q;2 

leads  to  the  conic  pro¬ 
gram: 


min  —  a  + 

w,b,^,a  2 


s.t.  yi  {w^Xi  +  b)  >1  -  ^i>0 

V <  a 


a  < 


T~ 

c 


(4a) 

(4b) 

(4c) 

(4d) 


Since  the  minimum  occurs  iff  =  w^w,  Eqn.4  is  equiva¬ 
lent  to  Eqn.3,  but  in  a  standard  conic  form. 
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Car  Classification  (CNN,  train  on  PASCAL,  test  on  PASCAL) 


Figure  8:  We  plot  the  influence  of  0  on  car  classiflcation  on 
PASCAL  2011  with  CNN  features.  When  there  is  only  one 
positive  training  example  available,  the  imaginary  classifler 
alone  {0  =  1)  obtains  better  performance  (blue).  However, 
when  there  is  a  medium  amount  of  training  data  (5  to  100) 
positives,  gently  incorporating  the  imaginary  classifler  {0  « 
0.7)  into  an  SVM  boosts  performance  (green,  red).  Error 
bars  show  standard  deviation  over  5  random  trials. 


We  optimize  Eqn.4  using  the  interior  point  method.  In 
our  experiments,  we  use  MOSEK  [1].  Optimization  took 
an  hour  on  typical  sized  problems,  but  since  we  use  a  gen¬ 
eral  purpose  solver,  improving  the  implementation  will  sig- 
niflcantly  increase  run  time  performance.  We  note  a  simi¬ 
lar  SVM  formulation  was  introduced  in  [12],  but  they  only 
impose  sign — not  orientation — constraints  on  the  weight 
vector.  Moreover,  observe  that  removing  Eqn.4d  makes  it 
equivalent  to  the  standard  SVM. 

cos~^{0)  specifles  the  angle  of  the  cone.  In  our  exper¬ 
iments,  we  found  30°  to  be  reasonable.  While  this  angle 
is  not  very  restrictive  in  low  dimensions,  it  becomes  much 
more  restrictive  as  the  number  of  dimensions  increases.  The 
probability  of  a  randomly  sampled  w  satisfying  the  rotation 
constraint  can  be  determined  by  calculating  the  surface  area 
of  the  spherical  cap  formed  with  the  cone,  then  dividing  it 
by  the  surface  area  of  the  whole  sphere  [25].  The  probabil¬ 
ity  of  a  random  w  satisfying  the  angle  constraint  for  20°  in 
3D  is  0.03,  but  drops  to  0(10“^^)  in  100  dimensions. 

4.3.  Transferring  Human  Bias  into  Recognition 

Since  we  believe  the  bias  in  the  human  visual  system  is 
favorable,  we  are  interested  in  transferring  this  bias  into  ob¬ 
ject  recognition.  To  accomplish  this,  we  can  train  an  SVM 
and  impose  the  imginary  classifler  as  an  orientation  prior. 

Using  the  same  evaluation  procedure  as  the  previous  sec¬ 
tion,  we  compare  three  approaches:  1)  a  single  SVM  trained 
with  only  a  few  positives  and  the  entire  negative  set,  2)  the 
same  SVM  with  orientation  priors  on  the  imaginary  classi¬ 
fler,  and  3)  the  imaginary  classifler  alone.  We  then  follow 
the  same  experimental  setup  as  before.  We  show  perfor¬ 
mance  on  car  classiflcation  with  CNN  features  in  Fig. 8  for 
varying  0  to  see  the  influence  of  the  imaginary  classifler 


on  the  SVM.  Note  that  with  one  positive  training  example 
(blue  curve),  the  imaginary  classifler  still  provides  the  best 
results,  suggesting  that  human  bias  is  more  valuable  than  a 
single  real  image. 

When  we  train  the  standard  SVM  with  flve  positive  ex¬ 
amples,  the  SVM  beats  imaginary  classiflers  alone.  How¬ 
ever,  by  incorporating  an  imaginary  classifler  as  an  orienta¬ 
tion  prior  into  the  learning  (green  curve),  the  SVM  is  forced 
to  And  a  solution  that  fits  the  data  while  agreeing  with  the 
human  visual  system,  beating  all  approaches  by  nearly  5% 
AP.  These  results  suggest  that  transfering  the  human  bias 
into  machine  learning  methods  can  improve  object  recog¬ 
nition  performance.  Finally,  training  on  the  entire  dataset 
(purple  curve)  gives  the  best  results  overall.  This  is  to  be 
expected  since  large  amounts  of  annotated  data  is  no  sub¬ 
stitute  for  noise.  However,  in  the  absence  of  big  data,  our 
results  suggest  extracting  knowledge  from  the  human  visual 
system  can  be  powerful. 

We  show  full  results  for  the  SVM  with  orientation  pri¬ 
ors  in  Fig. 9.  In  general,  imaginary  classiflers  are  able  to 
assist  the  SVM  when  the  amount  of  positive  training  data 
is  only  a  few  examples.  In  these  low  data  regimes,  acquir¬ 
ing  classiflers  from  the  human  visual  system  can  improve 
performance  by  signiflcant  margins,  sometimes  10%  AP. 

4.4.  Dataset  Generalization 

Several  recent  papers  have  reported  that  standard  com¬ 
puter  vision  datasets  suffer  from  dataset  biases  that  harm 
cross  dataset  generalization  performance  [35,  30].  Unfor¬ 
tunately,  there  is  no  known  method  to  flx  it  (although  there 
have  been  several  good  first  attempts  [22,  33, 19]).  Since  the 
imaginary  classiflers  are  immune  to  dataset  bias  (there  is  no 
dataset)  and  instead  inherit  human  biases,  our  approach  can 
offer  some  relief. 

We  trained  an  SVM  classifler  with  CNN  features  to  rec¬ 
ognize  cars  on  Caltech  101  [14],  but  we  tested  it  on  object 
classiflcation  with  PASCAL  VOC  2011.  Fig.  10a  suggest 
that,  by  constraining  the  SVM  to  be  close  to  the  imaginary 
classifler,  we  are  able  to  improve  the  generalization  perfor¬ 
mance  of  our  classiflers,  sometimes  over  5%  AP.  We  then 
tried  the  reverse  experiment  in  Fig.  10b:  we  trained  on  PAS¬ 
CAL  VOC  2011,  but  tested  on  Caltech  101.  While  PAS¬ 
CAL  VOC  provides  a  much  better  sample  of  the  visual 
world,  the  orientation  priors  still  help  generalization  per¬ 
formance  when  there  is  little  training  data  available.  These 
results  suggest  that  incorporating  the  biases  from  the  human 
visual  system  can  help  alleviate  some  dataset  bias  issues  in 
computer  vision. 

5.  Bias  in  the  Human  Visual  System 

We  have  so  far  examined  classiflers  acquired  from  an  in¬ 
ternational  population,  and  our  results  suggest  that  there  is  a 
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Figure  9:  We  show  AP  for  the  SVM  with  orienta¬ 
tion  priors  for  object  classification  on  PASCAL  VOC 
2011  for  varying  amount  of  positive  data  with  CNN 
features.  All  results  are  means  over  random  subsam¬ 
ples  of  the  training  sets.  SYM-^Hum  refers  to  SVM 
with  the  imaginary  classifier  as  an  orientation  prior. 
Green  indicates  an  improvement  of  at  least  1%. 


Car  Classification  (CNN,  train  on  Caltech  101,  test  on  PASCAL)  Car  Classification  (CNN,  train  on  PASCAL,  test  on  Caltech  101) 


(a)  Train  on  Caltech  101,  Test  on  PASCAL 


(b)  Train  on  PASCAL,  Test  on  Caltech  101 


Figure  10:  Since  the  imaginary  classifier  is  estimated  only  by  humans  looking  at  noise,  it  tends  to  be  biased  towards  the 
human  visual  system,  and  can  alleviate  some  problems  in  dataset  bias,  (a)  We  train  an  SVM  to  classify  cars  on  Caltech 
101  that  is  constrained  towards  the  imaginary  classifier,  and  evaluate  it  on  PASCAL  VOC  2011.  For  every  training  set  size, 
constraining  the  SVM  to  the  imaginary  classifier  with  0  ^  0.75  is  able  to  improve  generalization  performance,  (b)  We  train 
a  constrained  SVM  on  PASCAL  VOC  201 1  and  test  on  Caltech  101.  For  low  data  regimes,  the  imaginary  classifier  is  able  to 
again  boost  performance. 


bias  from  the  human  visual  system  that  influences  the  men¬ 
tal  images  that  people  imagine.  However,  everyone  does 
not  necessarily  share  the  same  bias  with  each  other. 

We  found  that  people  from  India  and  the  United  States 
may  have  different  mental  images  of  sports  balls.  We  in¬ 
structed  workers  on  Mechanical  Turk  to  find  “sport  balls” 
in  CNN  noise,  and  clustered  workers  by  their  geographic 
location.  Fig.  11  shows  the  imaginary  classifiers  for  both 
India  and  the  United  States.  Even  though  both  sets  of  work¬ 
ers  were  labeling  noise  from  the  same  distribution,  Indian 
workers  seemed  to  imagine  red  balls,  while  American  work¬ 
ers  tended  to  imagine  orange/brown  balls.  Remarkably, 
the  most  popular  sport  in  India  is  cricket,  which  is  played 
with  a  red  ball,  and  popular  sports  in  the  United  States  are 
American  football  and  basketball,  which  are  played  with 
brown/orange  balls.  We  hypothesize  that  Americans  and 
Indians  may  have  different  mental  images  of  sports  balls  in 
their  head  and  the  color  is  infiuenced  by  popular  sports  in 
their  country.  This  effect  is  likely  attributed  to  phenomena 
in  social  psychology  where  human  perception  can  be  infiu¬ 
enced  by  culture  [6,  4].  Since  environment  plays  a  role  in 
the  development  of  the  human  vision  system,  people  from 


different  cultures  likely  develop  slightly  different  images  in¬ 
side  their  head. 

This  effect  can  be  observed  on  more  categories,  some¬ 
times  manifesting  in  subtle  ways.  We  created  a  classifier 
for  each  country,  but  this  time  asked  workers  to  find  cars  in 
CNN  noise.  Fig.  12  shows  the  distribution  of  top  poses  that 
each  car  imaginary  classifier  finds.  Surprisingly,  the  Amer¬ 
ican  imaginary  classifier  favors  left-right  facing  cars,  while 
the  Singaporeans  favor  front-back  views  of  cars.  This  result 
suggests  that  people  may  different  biases  in  their  human  vi¬ 
sual  system. 

6.  Discussion 

While  the  ideas  in  this  paper  may  seem  unconventional, 
they  highlight  how  human  imagination  can  be  a  rich  re¬ 
source  for  computer  vision  systems.  Humans  are  able  to 
imagine  objects  under  any  transformation,  even  for  con¬ 
cepts  never  before  seen.  However,  creating  intelligent  vi¬ 
sion  machines  with  the  capability  to  imagine  radically  new 
concepts  never  before  encountered  in  its  data  remains  a  sig¬ 
nificant,  open  research  problem  in  our  field.  This  paper  ex¬ 
plores  this  direction  by  showing  that  it  is  possible  to  transfer 
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Car  Pose  Ranking  by  Country 


Figure  12:  We  created  an  imaginary  classifier  for  each  country,  and  examine  the  distribution  of  poses  that  each  classification 
image  favors.  Notice  that  US  workers  seem  more  likely  to  imagine  left/right  facing  cars,  while  Singapore  workers  may  favor 
imagining  front/back  facing  cars.  Please  see  text  for  details. 


(a)  India  (b)  United  States 


Figure  11:  By  instructing  workers  to  classify  CNN  noise 
as  a  sports  ball  or  not,  then  creating  imaginary  classifiers 
by  country  (shown  above),  we  reveal  the  different  mental 
images  of  sports  ball  (the  red/orange  circles  in  the  center) 
that  people  from  different  countries  have  inside  their  head. 
Indians  seem  to  imagine  a  red  ball,  which  is  the  standard 
color  for  a  cricket  ball  and  the  predominant  sport  in  India. 
Americans  seem  to  imagine  a  brown  or  orange  ball,  which 
could  be  an  American  football  or  basketball,  both  popular 
sports  in  the  U.S. 

classifers  extracted  from  human  imagination  into  a  machine 
with  modest  success.  Our  hope  is  that  our  ideas  will  inspire 
future  work  on  building  machines  with  the  ability  to  imag¬ 
ine  new  visual  concepts  just  like  a  human. 
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